Using RNN to model syllabic structure discrimination

Gonzalo Garcia-Castro
Nuria Sebastian-Galles
Chiara Santolin

PH3 MEeting

2024-04-30

Santolin et al. (2024)

Santolin et al. (2024)

Head-turn Preference Procedure

Familiarization phase

CVC CCV
sam sma
gel gle
pus psu
dor dro
sen sne

Test phase

CVC CCV
sap spa
kos kso

Santolin et al. (2024)

  • Infants encode and generalise CVC and CCV syllabic structures only when familiarized with CVC
  • Generalisation occurs regardless of phonetic information

Aims

  • Are CCV syllables more difficult to process than CVC syllables?
  • Have infants accumulated more exposure with the more frequent CVC than with CCV?
  • Simulate experimental outputs using Recurrent Neural Networks (RNN) (testing bench)

Neural Networks (RNN)

  • Bunch of regression models (nodes) stacked in layers
  • Receive input, generate output
  • Some nodes inform other nodes via connections whose relative importance is determined by weights

Recurrent Neural Networks (RNN)

  • Account for arbitrarily unfolding time series
  • Receive the additional input of their own previous state
  • Routinely used in speech recognition, text processing (e.g., transformers), and speech generation software

Proof of concept

Supervised audio classification task

  • Starting small: CV vs. VC diphones
  • Keep the model as simple (i.e., interpretable) as possible

Magnuson et al. (2020)

Audio processing

  • 7,000 audios, 700 unique diphones \(\times\) 10 speakers1
    • 3500 consonant-vowel (CV)
    • 3500 consonant-vowel (VC)
  • Amplitude envelope (Deloche, Bonnasse-Gahot, and Gervain 2024)
  • Normalized amplitude and duration (downsampling) across audios

RNN structure

  • 1 input node (receiving one audio sample in each time step)
  • 2x2 recurrent nodes
  • 1 output node (\(\sigma\) activation function), outputs a probability \(\in [0, 1]\)
    • \(\approx 1\) more likely CV, \(\approx 0\), more likely VC

RNN structure

  • 1 input node (receiving one audio sample in each time step)
  • 2x2 recurrent nodes
  • 1 output node (\(\sigma\) activation function), outputs a probability \(\in [0, 1]\)
    • \(\approx 1\) more likely CV, \(\approx 0\), more likely VC

Model training

  • Optimizer: ADAM (\(\epsilon = 0.001\))
  • Binary cross-entropy loss function
  • 30 epochs (early stopping at 95% accuracy)
  • Batch size: 16

Tensorflow + Keras

(Still tweaking things around.)

Results

Preliminary results

Preliminary results

Preliminary results

Future steps

  • Use spectrograms instead of envelope (e.g., Magnuson et al. 2020)
  • Unsupervised learning: replace output node with output layer (generation of spectrograms)
    • Encoder-decoder: What does the model think stereotipical CV or VC spectrograms look like?
  • More complex syllable structures: CVC, CCV
  • Take a look at embeddings

Future steps

Is the model better at classifying CVCs than CCVs? (Santolin et al. 2024)

  • Yes: complexity of the speech signal? (e.g., CC cluster)
  • No: infants have accumulated more experience with CVC (more frequent) than CCV?
    • If so, can we reproduce the results by manipulating the frequency of each syllabic structure in the model’s input

Discussion

  • Getting there
  • Usefulness of a working model go beyond this project
  • More technical difficulties than antipated
  • Patience needed, but little time (SIDE PROJECT)

References

Bertoncini, Josiane, Caroline Floccia, Thierry Nazzi, and Jacques Mehler. 1995. “Morae and Syllables: Rhythmical Basis of Speech Representations in Neonates.” Language and Speech 38 (4): 311–29.
Bijeljac-Babic, Ranka, Josiane Bertoncini, and Jacques Mehler. 1993. “How Do 4-Day-Old Infants Categorize Multisyllabic Utterances?” Developmental Psychology 29 (4): 711.
Deloche, François, Laurent Bonnasse-Gahot, and Judit Gervain. 2024. “Acoustic Characterization of Speech Rhythm: Going Beyond Metrics with Recurrent Neural Networks.” arXiv Preprint arXiv:2401.14416.
Magnuson, James S, Heejo You, Sahil Luthra, Monica Li, Hosung Nam, Monty Escabi, Kevin Brown, et al. 2020. “EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition.” Cognitive Science 44 (4): e12823.
Santolin, Chiara, Konstantina Zacharaki, Juan Manuel Toro, and Nuria Sebastian-Galles. 2024. “Abstract Processing of Syllabic Structures in Early Infancy.” Cognition 244: 105663.